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Abstract 

In the high-dimensional regression model a response variable is linearly related to p covariates, 
but the sample size n is smaller than p. We assume that only a small subset of covariates is 'active' 
(i.e., the corresponding coefficients are non-zero), and consider the model-selection problem of 
identifying the active covariates. 

i 1 A popular approach is to estimate the regression coefficients through the Lasso (^-regularized 

f-H least squares). This is known to correctly identify the active set only if the irrelevant covariates 

are roughly orthogonal to the relevant ones, as quantified through the so called 'irrepresentability' 
condition. In this paper we study the 'Gauss-Lasso' selector, a simple two-stage method that first 
solves the Lasso, and then performs ordinary least squares restricted to the Lasso active set. 

S We formulate 'generalized irrepresentability condition' (GIC), an assumption that is substan- 
tially weaker than irrepresentability. We prove that, under GIC, the Gauss-Lasso correctly recov- 
ers the active set. 
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1 Introduction 

In linear regression, we wish to estimate an unknown but fixed vector of parameters 6q £ ^L p from n 
pairs (yi,Xi), (1*2, X2), . . . , (Y n ,X n ), with vectors X; t taking values in W and response variables Yi 
given by 

Yi = (6 ,X l )+W l , Wi ~ N(0, a 2 ) , (1) 

where ( • , • ) is the standard scalar product. 

In matrix form, letting Y = (Yi, . . . , Y n ) T and denoting by X the design matrix with rows 
Xj, . . . , Xj, we have 

Y = X0 O + W, W~ N(0,a 2 I nxn ). (2) 

In this paper, we consider the high-dimensional setting in which the number of parameters exceeds 
the sample size, i.e., p > n, but the number of non-zero entries of 6$ is smaller than p. We denote 
by S = supp(#o) C [p] the support of 60, and let so = l^l- We are interested in the 'model selection' 
problem, namely in the problem of identifying S from data Y, X. 

In words, there exists a 'true' low dimensional linear model that explains the data. We want to 
identify the set S of covariates that are 'active' within this model. This problem has motivated a 
large body of research, because of its relevance to several modern data analysis tasks, ranging from 
signal processing [Don06, CRT06] to genomics [PZB + 10, SK03]. A crucial step forward has been the 
development of model-selection techniques based on convex optimization formulations [Tib96, CD95, 
CT07]. These formulations have lead to computationally efficient algorithms that can be applied to 
large scale problems. Such developments pose the following theoretical question: For which vectors 
6q, designs X, and noise levels a, the support S can be identified, with high probability, through 
computationally efficient procedures? The same question can be asked for random designs X and, in 
this case, 'high probability' will refer both to the noise realization W, and to the design realization 
X. In the rest of this introduction we shall focus -for the sake of simplicity- on the deterministic 
settings, and refer to Section 3 for a treatment of Gaussian random designs. 

The analysis of computationally efficient methods has largely focused on l\ -regularized least 
squares, a.k.a. the Lasso [Tib96]. The Lasso estimator is defined by 

n (Y,X;A) = argmin| — ||F - X0]|| + Allfllh) . (3) 
em? V 2n 11 11 j 
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In case the right hand side has more than one minimizer, one of them can be selected arbitrarily for 
our purposes. We will often omit the arguments Y, X, as they are clear from the context. (A closely 
related method is the so-called Dantzig selector [CT07]: it would be interesting to explore whether 
our results can be generalized to that approach.) 

It was understood early on that, even in the large-sample, low-dimensional limit n — > oo at p 
constant, supp(# n ) ^ S unless the columns of X with index in S are roughly orthogonal to the 
ones with index outside S [KFOO]. This assumption is formalized by the so-called 'irrepresentability 
condition', that can be stated in terms of the empirical covariance matrix S = (X T X/n). Letting 
T*a,b be the submatrix (£jj)j gj 4j e £, irrepresentability requires 

||Esc5E^sign(0 O) s)||oo ^ 1 -V, (4) 

for some 77 > (here sign(n)j = +1, 0, —1 if, respectively, u% > 0, = 0, < 0). In an early breakthrough, 
Zhao and Yu [ZY06] proved that, if this condition holds with rj uniformly bounded away from 0, 
it guarantees correct model selection also in the high-dimensional regime p 3> n. Meinshausen 
and Biilmann [MB06] independently established the same result for random Gaussian designs, with 
applications to learning Gaussian graphical models. These papers applied to very sparse models, 
requiring in particular so = 0(n c ), c < 1, and parameter vectors with large coefficients. Namely, 
scaling the columns of X such that < 1, for i G [p], they require 9 m i n = minjgs |0o,t| > CyJ sq/u. 

Wainwright [Wai09] strengthened considerably these results by allowing for general scalings of 
sq,p, n and proving that much smaller non-zero coefficients can be detected. Namely, he proved that 
for a broad class of empirical covariances it is only necessary that 9 m i n > cay/ (\ogp)/n. This scaling 
of the minimum non-zero entry is optimal up to constants. Also, for a specific classes of random 
Gaussian designs (including X with i.i.d. standard Gaussian entries), the analysis of [Wai09] provides 
tight bounds on the minimum sample size for correct model selection. Namely, there exists eg, c u > 
such that the Lasso fails with high probability if n < eg so log p and succeeds with high probability if 
n > c u s log p. 

While, thanks to these recent works [ZY06, MB06, Wai09], we understand reasonably well model 
selection via the Lasso, it is fundamentally unknown what model-selection performances can be 
achieved with general computationally practical methods. Two aspects of of the above theory cannot 
be improved substantially: (i) The non-zero entries must satisfy the condition $min 

> ca I y/n to be 

detected with high probability. Even if n = p and the measurement directions Xi are orthogonal, 
e.g., X = y/nlnxn, one would need |#o,i| > ca / y/n to distinguish the i-th entry from noise. For 
instance, in [JM13], the present authors prove a general upper bound on the minimax power of 
tests for hypotheses Hq^ = {#o,« = 0}. Specializing this bound to the case of standard Gaussian 
designs, the analysis of [JM13] shows formally that no test can detect #o,i 7^ 0) with a fixed degree of 
confidence, unless |#o,i| ^ ca/y/n. (ii) The sample size must satisfy n > sq. Indeed, if this is not the 
case, for each 6>o with support of size | *S' | = so, there is a one parameter family {0o(t) = 9q + t u}teR 
with supp(#o(£)) Q S, X6*o(i) = X#o and, for specific values of t, the support of 9o(t) is strictly 
contained in S. 

On the other hand, there is no fundamental reason to assume the irrepresentability condition (4). 
This follows from the requirement that a specific method (the Lasso) succeeds, but is unclear why 
it should be necessary in general. The situation is very different for estimation consistency, e.g., for 
characterizing the £2 error \\9 — #o||2- In that case the restricted isometry property (RIP) [CT05] (or 
one of its relaxations [BRT09, vdGB09]) is sufficient and -essentially- necessary. 
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Gauss-Lasso selector: Model selector for high dimensional problems 
Input: Measurement vector y, design model X, regularization parameter A, support size sq. 
Output: Estimated support S. 
1: Let T = supp(6> n ) be the support of Lasso estimator 6 n = 6 n (y, X, A) given by 

HY,^)^^n{l\\Y-Xef 2 + X\ mi }. 

2: Construct the estimator 9 GL as follows: 

9j>^ = (X^Xj 1 ) ^X^y , Oj'c' = . 
3: Find SQ-ih largest entry (in modulus) of 9 Gh , denoted by and let 

d={i€]p]'- \0? L \>\^\}- 



In this paper we prove that the Gauss-Lasso selector has nearly optimal model selection properties 
under a condition that is strictly weaker than irrepresent ability. We call this condition the generalized 
irrepresentability condition (GIC). The Gauss-Lasso procedure uses the Lasso estimator to estimate 
a first model T C {1, . . . ,p}. It then constructs a new estimator by ordinary least squares regression 
of the data Y onto the model T. 

We prove that the estimated model is, with high probability, correct (i.e., S = S) under conditions 
comparable to the ones assumed in [MB06, ZY06, Wai09], while replacing irrepresentability by the 
weaker generalized irrepresentability condition. In the case of random Gaussian designs, our analysis 
further assumes the restricted eigenvalue property in order to establish a nearly optimal scaling of 
the sample size n with the sparsity parameter sq. 

In order to build some intuition about the difference between irrepresentability and generalized 
irrepresentability, it is convenient to consider the Lasso cost function at 'zero noise': 

G(0;£) = ^||X(0-0o)||i + «l 

= i((0-0 o ),E(0-0 o ))+eil%. 

Let ZN (^) be the minimizer of G( • ;£) and v = lim^_j.o+ s ig n (^ ZN (C))- The limit is well defined by 
Lemma 2.2 below. The KKT conditions for ^ N imply, for T = supp(u), 

||E T c i :r£^ r t)'r|| 00 < 1. 

Since G{ ■ ; £) has always at least one minimizer, this condition is always satisfied. Generalized 
irrepresentability requires that the above inequality holds with some small slack n > bounded 
away from zero, i.e., 

||St c ,tS t ^i't||oo 
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Notice that this assumption reduces to standard irrepresent ability cf. Eq. (4) if, in addition, we 
ask that v = sign(#o). In other words, earlier work [MB06, ZY06, Wai09] required generalized 
irrepresentability plus sign-consistency in zero noise, and established sign consistency in non-zero 
noise. In this paper the former condition is shown to be sufficient. 

From a different point of view, GIC demands that irrepresentability holds for a superset of the 
true support S. It was indeed argued in the literature that such a relaxation of irrepresentability 
allows to cover a significantly broader set of cases (see for instance [BvdGll, Section 7.7.6]). However, 
it was never clarified why such a superset irrepresentability condition should be significantly more 
general than simple irrepresentability. Further, no precise prescription existed for the superset of the 
true support. 

Our contributions can therefore be summarized as follows: 

1. By tying it to the KKT condition for the zero- noise problem, we justify the expectation that 
generalized irrepresentability should hold for a broad class of design matrices. 

2. We thus provide a specific formulation of superset irrepresentability, prescribing both the su- 
perset T and the sign vector vt, that is -by itself- significantly more general than simple 
irrepresent ability. 

3. We show that, under GIC, exact support recovery can be guaranteed using the Gauss-Lasso, 
and formulate the appropriate 'minimum coefficient' conditions that guarantee this. 

As a side remark, even when simple irrepresentability holds, our results strengthen somewhat the 
estimates of [Wai09] (see below for details). 

The paper is organized as follows. In the rest of the introduction we illustrate the range of 
applicability of GIC through a simple example and we discuss further related work. We finally 
introduce the basic notations to be used throughout the paper. 

Section 2 treats the case of deterministic designs X, and develops our main results on the basis of 
the GIC. Section 3 extends our analysis to the case of random designs. In this case GIC is required 
to hold for the population covariance, and the analysis is more technical as it requires to control the 
randomness of the design matrix. The proofs of our main results can be found in Sections 5 and 6, 
with several technical steps deferred to the Appendices. 

1.1 An example 

In order to illustrate the range of new cases covered by our results, it is instructive to consider a 
simple example. A detailed discussion of this calculation can be found in Appendix B. The example 
corresponds to a Gaussian random design, i.e., the rows Xj, . . . X J are i.i.d. realizations of a p- 
variate normal distribution with mean zero. We write Xi = X^, ■ ■ ■ , Aj !P ) T for the components 

of Xi. The response variable is linearly related to the first sq covariates 

Y{ = #0,1^1 + #0,2-^,2 + • • • + 9o !SQ Xi iSQ + Wi , 

where Wi ~ N(0, a 2 ) and we assume 6>o,j > for all i < so- In particular S = {1, . . . , so}. 

As for the design matrix, first p — 1 covariates are orthogonal at the population level, i.e., Xij ~ 
N(0, 1) are independent for 1 < j < p — 1 (and 1 < i < n). However the p-ih. covariate is correlated 
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to the sq relevant ones: 



X 



i.p 



a X it i + a Xifl H ha X i>so + b X. 



i,p ■ 



Here ~ 



N(0, 1) is independent from {X^i, . . . , Xj )P _i} and represents the orthogonal component 



1 



of the p-th covariate. We choose the coefficients a, b > such that sqcl 1 + b 2 = 1, whence E{X? p } 
and hence the p-th covariate is normalized as the first (p — 1) ones. In other words, the rows of X 
are i.i.d. Gaussian Xi ~ N(0, S) with covariance given by 



1 if i = j, 

a if i = p, j G S or i G S, j = p, 
otherwise. 



For a = 0, this is the standard i.i.d. design and ir represent ability holds. The Lasso correctly 
recovers the support S from n > csq log p samples, provided 6 m \ n > c'y 7 (log p)/n. It follows from 
[Wai09] that this remains true as long as a < (1 — r])/so for some r] > bounded away from 0. 
However, as soon as a > 1/sq, the Lasso includes the p-th covariate in the estimated model, with 
high probability (see Appendix B). 

As it is shown in Appendix B, the Gauss-Lasso is successful for a significantly larger set of values 
of a. Namely, if 



a G 



0, 



1 



V 



•so 



U 



1 1 



•so 



then it recovers S from n > csologp samples, provided 6 m \ n > d^J (log p)/n. While the interval 
((1 — r])/so, 1/so] is not covered by this result, we expect this to be due to the proof technique rather 
than to an intrinsic limitation of the Gauss-Lasso selector. 



1.2 Further related work 

The restricted isometry property [CT05, CT07] (or the related restricted eigenvalue [BRT09] or 
compatibility conditions [vdGB09]) have been used to establish guarantees on the estimation and 
model selection errors of the Lasso or similar approaches. In particular, Bickel, Ritov and Tsybakov 
[BRT09] show that, under such conditions, with high probability, 

\\e-e \\j<ca*^v. 

n 

The same conditions can be used to prove model-selection guarantees. In particular, Zhou [ZholO] 
studies a multi-step thresholding procedure whose first steps coincide with the Gauss-Lasso. While 
the main objective of this work is to prove high-dimensional £2 consistency with a sparse estimated 
model, the author also proves partial model selection guarantees. Namely, the method correctly 
recovers a subset of large coefficients Sl ^ S, provided \9o,i\ > c<jTyso(logp)/n, for i G Sl- This 
means that the coefficients that are guaranteed to be detected must be a factor ^/so larger than what 
is required by our results. 

Also related to model selection is the recent line of work on hypothesis testing in high-dimensional 
regression [ZZ11, Buhl2]. These papers propose methods for testing hypotheses of the form Hq^ = 
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{Gqa = 0}. In order to achieve a given significance level, they require -again- large coefficients, 
namely |0o,i| > ca^so(\ogp)/n (see [JM13] for a discussion of this point). In [JM13], we investigate 
a hypothesis testing method that achieves any given significance level a for |#o,i| > ca/y/n, with 
c a constant that depends on a. Although the testing procedure can be used for general setting, 
the guarantee on its statistical power is provided only for some random Gaussian designs in an 
asymptotic sense. A very recent paper by van de Geer, Biihlmann and Ritov [vdGBR13] proposes 
a similar procedure and gives conditions under which the procedure achieves the semiparametric 
efficiency bound. Their analysis allows for general Gaussian and sub-Gaussian designs. However, it 
requires a sample size n > C(sologp) 2 , namely the square of the optimal sample size. 

Let us finally mention that an alternative approach to establishing model-selection guarantees 
assumes a suitable mutual incoherence conditions. Lounici [Lou08] proves correct model selection 
under the assumption max^- = 0(1/sq). This assumption is however stronger than irrepre- 
sentability [vdGB09]. Candes and Plan [CP09] also assume mutual incoherence, albeit with a much 
weaker requirement, namely max^.,- = 0(l/(logp)). Under this condition, they establish model 
selection guarantees for an ideal scaling of the non-zero coefficients 9 mm > ca\J (logp)/n. How- 
ever, this result only holds with high probability for a 'random signal model' in which the non-zero 
coefficients #o,i have uniformly random signs. 

Finally, model selection consistency can be obtained without irrepresent ability through other 
methods. For instance [Z011O6] develops the adaptive Lasso, using a data-dependent weighted l\ 
regularization, and [Bac08] proposes the Bolasso, a resampling-based techniques. Unfortunately, 
both of these approaches are only guaranteed to succeed in the low-dimensional regime of p fixed, 
and n — > oo. 

1.3 Notations 

We provide a brief summary of the notations used throughout the paper. For a matrix A and set of 
indices /, J, we let Aj denote the submatrix containing just the columns in J and Ajj denote the 
submatrix formed by the rows in / and columns in J. Likewise, for a vector v, vi is the restriction of 
v to indices in /. Further, the notation Aj 1 represents the inverse of Ajj, i.e., AJ j = (Ai j) . The 
maximum and the minimum singular values of A are respectively denoted by cr max (^4) and o" m ; n (^4). 
We write \\v \\ p for the standard £ p norm of a vector v. Specifically, \\v ||o denotes the number of 
nonzero entries in v. Also, \\A\\ p refers to the induced operator norm on a matrix A. We use to 
refer to the i-th standard basis element, e.g., e\ = (1,0, .. . ,0). For a vector v, supp(v) represents 
the positions of nonzero entries of v. Throughout, we denote the rows of the design matrix X by 
X\, . . . , X n S W and denote its columns by x\, . . . , x p £ W 1 . Further, for a vector v, sign(u) is the 
vector with entries sign(w)j = +1 if vi > 0, sign(u)j = —1 if vi < 0, and sign(w )j = otherwise. 

2 Deterministic designs 

An outline of this section is given below: 

1. We first consider the zero- noise problem W = 0, and prove several useful properties of the Lasso 
estimator in this case. In particular, we show that there exists a threshold for the regularization 
parameter below which the support of the Lasso estimator remains the same and contains 
supp(#o)- Moreover, the Lasso estimator support is not much larger than supp(#o)- 
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2. We then turn to the noisy problem, and introduce the generalized irrepresentability condition 
(GIC) that is motivated by the properties of the Lasso in the zero-noise case. We prove that 
under GIC (and other technical conditions), with high probability, the signed support of the 
Lasso estimator is the same as that in the zero-noise problem. 

3. We show that the Gauss-Lasso selector correctly recovers the signed support of 9q. 



2.1 Zero-noise problem 

Recall that £ = (X T X/n) denotes the empirical covariance of the rows of the design matrix. Given 
£ € M px? ', £ y 0, #o £ an d £ G M+, we define the zero-noise Lasso estimator as 

^ N (0 = arg nun {^((9 - #o), S(0 - <?„)) + «i) • (5) 

Note that ZN (^) is obtained by letting Y = X6> in the definition of d n (Y, X; £). 

Following [BRT09], we introduce a restricted eigenvalue constant for the empirical covariance 
matrix £: 

~( \ - ■ ■ (u,T,u) 

k(s,cq)= mm mm — — ^— . (b 
JC\p] uERP \\u\\Z 

\J\<8 IK/ C l|l< c <)|l«j||l 

Our first result states that the support of ZN (^) is not much larger than the support of 9q, for 
any £ > 0. 

Lemma 2.1. Let 0™ = ^ N (£) 6e de/zned as per Eg. fl7j, wtfi £ > 0. T/ien, z/s = INIo, 



< 1 + )-o. (7) 



The proof of this lemma is deferred to Section A.l. 

Lemma 2.2. Let ^ N = 2N (£) 6e defined as per Eq. (5), with £ > 0. Then there exist £o = 
£o(Ti,S, 6*o ) > 0, To C [p], u G {— 1,0,+1} P , suc/i t/iat i/ie following happens. For all £ € (0, £o)> 
sign(^ ZN (£)) = uo and supp(0 ZN (£)) = supp(vo) = To. Further To 5 S, vq^s = sign(#o,s) and 
£o = min^s |0o,i/[£T O) T o vo,2b]i|- 

Proof of Lemma 2.2 can be found in Section A. 2. 

Finally we have the following standard characterization of the solution of the zero-noise problem. 

Lemma 2.3. Let 0™ = ^ N (£) be defined as per Eq. (5), with £ > 0. LetT D S andv € {+1, 0, 
be such that supp(w) = T. Then sign(0 ZN ) = v if and only if 

St=,tS^ IpVT < 1 , (8) 

v T = sign(6» ,r - i^ l T v T ) ■ (9) 



Further, if the above holds, ^ N is given by 9j£ = and 
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Lemma 2.3 is proved in Appendix A. 3. 

Motivated by this result, we introduce the generalized irrepresentability condition (GIC) for 
deterministic designs. 

Generalized irrepresentability (deterministic designs). The pair (E, 9q), E G M pxp , 
#o G satisfy the generalized irrepresentability condition with parameter rj > if the following 
happens. Let vo, Tq be defined as per Lemma 2.2. Then 

^t§,t %1 T[ b «o,ib < 1 - V ■ (10) 

In other words we require the dual feasibility condition (8) -which always holds- to hold with a 
positive slack r\. 

2.2 Noisy problem 

Consider the noisy linear observation model as described in (2), and let r = (X T W/n). We begin 
with a standard characterization of sign(# n ), the signed support of the Lasso estimator (3). 

Lemma 2.4. Let 9 n = 9 n (y, X; A) be defined as per Eq. (3), and let z G {+1, 0, —1} P with supp(z) = 
T. Further assume T D S. Then the signed support of the Lasso estimator is given by sign(0™) = z 
if and only if 

YItc^ttZt + \ {rr c - Et c .t^tt^t) < 1 , (11) 

A oo 

z T = sign((9 ,T - fq} T {\z T - f T )J • (12) 

Lemma 2.4 is proved in Appendix A. 4. 

Theorem 2.5. Consider the deterministic design model with empirical covariance matrix E = 
(X T X)/n ; and assume that E^j < 1 for i G [p]. Let Tq C [p], w G {+1,0, —1} P 6e t/ie set and 
vector defined in Lemma 2.2, and to = |Tq|. Assume that 

(i) We have o- min (E TojTo ) > C min > 0. 

(ii) The pair (T,,9q) satisfies the generalized irrepresentability condition with parameter n. 
Consider the Lasso estimator 9 n = 6> n (y,X; A) defined as per Eq. (3), with regularization parameter 



A= a/2c i l0gp ) 
■q V n 

for some constant c\ > 1, and suppose that 
(Hi) For some C2 > 0: 

\9o,i\ > c 2 X + A| [E^^ To vo,T }i\ for all i G 5, (14) 
| [Sy^^Tolil > c 2 /or a// i G ?o \ 5. (15) 
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We further assume, without loss of generality, r] < C2v / Cmm- Then the following holds true: 

p{sign(0 n (A)) = v } > 1 - V" Cl . (16) 

Theorem 2.5 is proved in Section 5.1. Note that, even in the case standard irrepresentability 
holds (and hence To = S), this result improves over [Wai09, Theorem l.(b)], in that the required 
lower bound for |#o,i|j i € S, does not depend on ||Ss,s||oo- More precisely, Theorem 2.5 assumes 
\0Q,i\ > ^(c2 + | Pss^o.skl)) for i G S, which is weaker than the assumption of Theoreml . (b) [Wai09] , 
namely, |0 o ,i| > A(c+ HS^Hoo), since ||fo,s||oo < 1- 

Remark 2.6. Condition (i) in Theorem 2.5 requires the submatrix £t ,To to have minimum singular 
value bounded away form zero. Assuming Y*s,s to be non-singular is necessary for identifiability. 
Requiring the minimum singular value of Y>t ,t to be bounded away from zero is not much more 
restrictive since Tq is comparable in size with S, as stated in Lemma 2.1. 

We next show that the Gauss-Lasso selector correctly recovers the support of 9q. 

Theorem 2.7. Consider the deterministic design model with empirical covariance matrix E = 
(X T X)/n, and assume that Ej^ < 1 for i € [p]. Under the assumptions of Theorem 2.5, 

^(\\(F h - tfoHoo >fij< V~ Cl + 2pe~ nC ^ 2/2a2 . 

In particular, if S is the model selected by the Gauss-Lasso, we have 

F(S = S) > l-Gp 1 -^ 4 . 
The proof of Theorem 2.7 is given in Section 5.2. 

3 Random Gaussian designs 

In the previous section, we studied the case of deterministic design models which allowed for a 
straightforward analysis. Here, we consider the random design model which needs a more involved 
analysis. Within the random Gaussian design model, the rows Xi are distributed as X-i ~ N(0,E) 
for some (unknown) covariance matrix S >- 0. 

In order to study the performance of Gauss-Lasso selector in this case, we first define the 
population-level estimator. Given E G W xp , S y 0, 9q £ W and £ G M+, the population-level 
estimator 0°°(f) = 0°°(£;6)o,£) is defined as 

9°°(0 = arg mm {i ((9 - 9 ), E(fl - 9 )) + . (17) 

Notice that the minimizer is unique because S is strictly positive definite and hence the cost function 
on the right-hand side is strongly convex. In fact, the population- level estimator is obtained by 
assuming that the response vector Y is noiseless and n = oo, hence replacing the empirical covariance 
(X T X/n) with the exact covariance E in the lasso optimization problem (3). 

Notice that the population-level estimator 9°° is deterministic, albeit X is a random design. We 
show that under some conditions on the covariance E and vector 9q, T = supp(# n ) = supp(^ oc ), i.e., 
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the population-level estimator and the Lasso estimator share the same (signed) support. Further T 5 
S. Since 9°° (and hence T) is deterministic, Xy is a Gaussian matrix with rows drawn independently 
from N(0,Et ) t). This observation allows for a simple analysis of the Gauss-Lasso selector 9 GL . 
An outline of the section is given below: 

1. We begin with proving several properties of the population- level estimator. Similar to the 
zero- noise problem in Section 2.1, we show that there exists a threshold £q, such that for all 
£ G (0,£o)> su pp(#°°(£)) remains the same and contains supp(#o)- Moreover, supp(#°°(£)) is 
not much larger than supp(#o)- 

2. We show that under GIC for covariance matrix E (and other sufficient conditions), with high 
probability, the signed support of the Lasso estimator is the same as the signed support of the 
population-level estimator. 

3. Following the previous steps, we show that the Gauss-Lasso selector correctly recovers the 
signed support of 9q. 



3.1 The n = oo problem 

In this section we derive several useful properties of the population- level problem (17). Comparing 
Eqs. (5) and (17), the estimators ^ N (£) and 0°°(£) are defined in a very similar manner (the former 
is defined with respect to E and the latter is defined with respect to E), and as we will see 9°° also 
possesses the properties stated in Section 2.1. 

Let ^oo('S)Co) be the restricted eigenvalue constant for the covariance matrix E: 

(u, En) 

K(s, cn) = mm min — — ^— . (18) 
jc\p] ueR p \\u\\% 
\J\<s lk/ c !li< c o||«j1li 

The proofs of the following Lemmas are very similar to the corresponding ones in Section 2.1, 
and are omitted. 

Lemma 3.1. Let 9°° = 0°°(f) be defined as per Eq. (17), with £ > 0. Then, if sq — ||$o||o> 

n<(i+ 4 ^) so . (19) 

Lemma 3.2. Let 9°° = 9°°(^) be defined as per Eq. (17), with £ > 0. Then there exist £o = 
£o(E,S, 6*o ) > 0, To C [p] ; vq G {— 1, 0, +1} P , such that the following happens. For all £ G (0,£o), 
sign(#°°(£)) = vq and supp(6' oc '(£)) = supp(^o) = Tq. Further Tq D S, vq^s = sign(#o,s) and 
£o = min i6S < |#o,i/[E^ To t>o,To]i|- 

Finally we have the following standard characterization of the solution of the n = oo problem 
(17). 

Lemma 3.3. Let 9°° = #°°(£) be defined as per Eq. (17), with £ > 0. LetT D S andv G {+1, 0, -1} P 
be such that supp(w) = T. Then sign(0°°) = v if and only if 



Eye E^ jifr 



< 1 



v T = sign(0 O) T - £E T 1 r i> T ) ■ 
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Further, if the above holds, 9°° is given by 9^? c = and 

#T = ^o,T — ^t[t Vt • 

Motivated by this result, we introduce the following assumption. 

Generalized irrepresentability (random designs). The pair (E, 9q), E G W xp , 9$ G 
W satisfy the generalized irrepresentability condition with parameter rj > if the following 
happens. Let vq, Tq be defined as per Lemma 3.2. Then 

^T§,T ^To,T v O,T <l~r], (20) 

3.2 The high-dimensional problem 

We now consider the Lasso estimator (3). Recall the notations 

E=-X T X, ?= 1 X T W. 

n n 

Note that E G M pxp , r G M p are both random quantities in the case of random designs. 

Theorem 3.4. Consider the Gaussian random design model with covariance matrix E y 0, and 
assume that E^j < 1 for i G \p\. Let Tq C [p], vq G {+1,0, — 1} P be the deterministic set and vector 
defined in Lemma 3.2, and to = \Tq\. Assume that 

(i) We have <T min (Er ,To) - C mm > 0. 

(ii) The pair (E,#o) satisfies the generalized irrepresentability condition with parameter r\. 
Consider the Lasso estimator 6 n = # n (y,X; A) defined as per Eq. (3), with regularization parameter 



x= 4a cilogP (21) 
f] V n 

for some constant c\ > 1, and suppose that 
(Hi) For some c 2 > 0: 

3 

|0o,i| > c 2 A + -A|[E^ To ?;o,T ]i| for all i G 5, (22) 
| [Ey^t^Tbkl > 2c 2 for all i £T \S. (23) 

We further assume, without loss of generality, -q < C2\ / C m i 11 . 



If n > max(Mi, M^to logp with 



Mi = ' , M 3 - 



7 0111111 c 2 u min 

t/ien t/ie following holds true: 

p{sign(0 n (A)) = w } > 1 - pe"T5 - 6e"^ - 8p 1_Cl . (24) 
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Under standard irrepresent ability, this result improves over [Wai09, Theorem 3.(ii)], in that the 

—1/2 

required lower bound for \0o,i\, i € S, does not depend on ||oo- More precisely, Theorem 2.5 

assumes \Qo,i\ — ^( c 2 + 1 .5 1 [S^wo^jiD, for i G S, while Theorem 3.(ii)[Wai09] requires \9o,i\ > 

cA||S^ /2 ||^, for i G 5. Note that |[S^v 0) 5]i| < ll^slloo < W^S 

While being closely analogous to Theorem 2.5, the last theorem has somewhat worse constants. 
Indeed in the present case we need to control the randomness of the design matrix X in addition to 
the one of the noise. 

Remark 3.5. Condition (i) follows readily from the restricted eigenvalue constraint as in Eq. (18), 
i.e., ^00(^0,0) > 0. This is a reasonable assumption since Tq is not much larger than So, as stated 
in Lemma 3.1. 

Corollary 3.6. Under the assumptions of Theorem 3.4, if n > max(Mi, Ms)so logp, with 
~ / 4IISII2 \ — / 4IIEII2 \ 

then the following holds: 

— ^ ft s O 1 

sign(6» n (A)) = v \ > l-pe~w - 6e - ^ - %p l ~ Cl . 

Proof (Corollary 3.6). The result follows readily from Theorem 3.4, noting that sq < to since So Q 
To, and to < (1 + 4|| XI || 2/ ^00 (so ; l)) s o a s per Lemma 3.1. □ 

Below, we show that the Gauss-Lasso selector correctly recovers the signed support of #o- 

Theorem 3.7. Consider the random Gaussian design model with covariance matrix £ >- 0, and as- 
sume that S^j < 1 fori £ [p]. Under the assumptions of Theorem 3.4, and for n > max(Mi, M^)so logp, 
we have 

P( ||0° L - flolloo > lA < pe~T5 + 6e"f + Bp 1 ' 01 + 2pe~ nC ^ ' /2a2 . 



Moreover, letting S be the model returned by the Gauss-Lasso selector, we have 

F(S = S) > l-pe - w -6e - ^ -lOp 1 ^ 1 . 

The proof of Theorem 3.7 is deferred to Section 6.4. 

Remark 3.8. [Detection level] Let m { n = minj e s|#o,i| ^ e the minimum magnitude of the non- 
zero entries of vector 9q- By Theorem 3.7, Gauss-Lasso selector correctly recovers supp(#o), with 
probability greater than 1 — pe~w — 6e _ ~2~ — 10p 1_Cl ; if n > max(Mi, Ms)so logp, and 



c*f°^(i + 11^1100), (25) 

where C = C(ci, 02,77) is a constant depending on c\, 01, andrj. Eq. (25) stems from the condition (Hi) 
in Theorem 3.4- 

We can further generalize this result. Define 



S 1 = {zeS: 1^1 > CaJ^ (1 + IIS^JU) \ , 
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-W'-v*^-.;. s ,<»v % »' •^v* r * f . 



20 40 60 80 100 

Figure 1: Parameter vector #o f° r the communities dataset. The entries with magnitude larger than 
0.04 (shown in black) are treated as significant ones. 



and S2 = S\S\. By a very similar argument to the proof of Theorem 3.4, the Gauss-Lasso selector 
can recover Si, if ||^o,5 2 ll = 0(ay/\ogp/n). More precisely, letting W = X#o,s 2 + W, the response 
vector Y can be recast asY = X#o,Si + W and the Gauss-Lasso selector treats the small entries 9o,S2 
as noise. 

4 UCI communities and crimes data example 

We consider a problem about predicting the rate of violent crimes in different communities within 
US, based on other demographic attributes of the communities. We evaluate the performance of 
the Gauss-Lasso selector on the UCI communities and crimes dataset [FA10]. The dataset consists 
of a univariate response variable and 122 predictive attributes for 1994 communities. The response 
variable is the total number of violent crimes per WOK population. Covariates are quantitative, 
including e.g., the average family income, the fraction of unemployed population, and the police 
operating budget. We consider a linear model as in (2) and perform model selection using Gauss- 
Lasso selector and Lasso estimator. 

We do the following preprocessing steps: (i) Each missing value is replaced by the mean of the 
non-missing values of that attribute for other communities; (ii) We eliminate 16 attributes to make 
the ensemble of the attribute vectors linearly independent; (Hi) We normalize the columns to have 
mean zero and £2 norm ^/n. Thus we obtain a design matrix Xtot £ R ntotX P with n to t = 1994 and 
p = 106. 

For the sake of performance evaluation, we need to know the true model, i.e., the true significant 
covariates. We let #0 = (X^ ot X to t)~ 1 X|j ot 2/ be the least square solution obtained from the whole 
dataset Xtot- The entries of 9q are shown in Fig. 1. Clearly only a few of them are non negligible, 
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corresponding to the true model. We treat the entries with magnitude larger than 0.04 as truly 
active and the others as truly inactive. The number of active covariates according to this criterion 
is so = 13. 

We choose random subsamples of size n = 85 from the communities and normalize each column 
of the resulting design matrix to have mean zero and £2 norm y/n. We use Gauss-Lasso selector 
and Lasso for model selection based on this design. Figures 2 and 3 respectively show the solution 
path for Gauss-Lasso and Lasso as the parameter A changes form A = 0.001 to A = 1. The paths 
corresponding to the truly active set are in black and the paths corresponding to the truly inactive 
variables are in red. At A = 1, the solutions # GL (A) and 9 n (X) have no active variables; for decreasing 
A, each knot A& marks the entry or removal of some variables from the current active set of the Lasso 
solution. Therefore, the support of the Lasso solution T remains constant in between knots. Since 
Gauss-Lasso selector performs ordinary least squares restricted to T, its coordinate paths are constant 
in between knots. However, the Lasso paths are linear with respect to A, with changes in slope at 
the knots (see e.g., [EHJT04] for a discussion). 

It is clear from Figure 3 that the Lasso support either misses a large fraction of the truly active 
covariates, or includes many false positives. For instance at A = 0.08, we get 4 true positives out 
of 13 and 4 false positives. On the other hand, for a smaller value of the regularization parameter, 
A = 0.01, we get 10 true positives out of 13 and 8 false positives. 1 

If we consider on the other hand the Gauss-Lasso, any A < 0.02 produces a set of coefficients 
with a gap between large ones, that are mostly true positives, and small ones, that are mostly true 
negatives. 



In this section we prove Theorems 2.5 and 2.7 using Lemmas 2.1 to 2.4. The latter are proved in the 
appendices. 

5.1 Proof of Theorem 2.5 

By the condition (iii) in the statement of the theorem, we have 



where the equality holds because of Lemma 2.2. By Lemma 2.2, we know that sign((9 (A)) = vq 
and that supp(vo) = To contains the true support S. Applying Lemma 2.3, Eq. (9) and using the 
generalized irrepresentability assumption (10), we obtain 



5 Proof of Theorems 2.5 and 2.7 



A < min 



111111 — ~ 



Sto c ,t S To)To uo,To 



< 1 - V 



(26) 



00 




(27) 



We treat the entries of the Lasso solution with magnitude less than 0.005 as zero. 
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Figure 2: Coordinate paths for Gauss-Lasso selector and a random subset of n = 85 communities. 
The paths corresponding to the significant variables of 6q are shown in black. The coordinate paths 
for Gauss-Lasso are piecewise constant. 



Lasso 
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Figure 3: Coordinate paths for Lasso selector and a random subset of n = 85 communities. The 
paths corresponding to the significant variables of 9q are shown in black. The coordinate paths for 
Lasso are piecewise linear. 



Also, by Lemma 2.4, sign( 



vq if Eqs. (11) and (12) hold with z = vq and T = Tq, namely, if 



1 



< 1 



Vo,T = sign 6^ ,r -1 



>To,T ( Xv O,T -fr )) 



(28) 
(29) 



In the sequel, we show that these equations are satisfied, with probability lower bounded as per 
Eq. (16). 

We begin with proving Eq. (28). Let T = (1/A)(ry c — Tit c ,t ^To t ^t ) ■ We need to show that 

||T||oo < V- Plugging for f, we get T = X TC n x ± W/(nX), where II x x = I - X To (Xj , X^J^Xl 

T o T o 

is the orthogonal projection onto the orthogonal complement of the column space of X^ . Since 
W ~ N(0, <7 2 I nX n); the variable 71 = ccjll x ± W/(n\) is normal with variance at most 

(^) 2||n ^ ll ^(^:) a|1 ^ 11 ^^' 

where we used the fact that ||a3j|| 2 < n, as S^j < 1. By the Gaussian tail bound with union bound 
over j G Tq , we obtain 

>2 2 

F(||7"||oo < V) > 1 - 2pe"^^ = 1 - 2P 1 - 01 . (30) 
We next prove Eq. (29). Given Eq. (27), we need to show 

sign (<9 ,t - >^To,T v o,To) = sign((9o,To - ^t \t Oo,t - r To )J . 

Let u = 0o,T o - AS T 1 ,To' yo ' T o' and " - °> T o ~ ^ToVo ( Au °' T o ~ ^o)- 

By condition (Hi), we have, for all i G S, |tij| > |#o,i| ~~ A| PtoVo^O-^oW — c 2^- Further, for all 
z € To \ S, we have = A|[Ey Q y uo,7b]»l — c 2^- Summarizing, for all i € To, we have \m\ > C2A. 
We will show that — u\\oo = HSy 1 y rr ||oo < C2A, with high probability, thus implying sign(«T ) = 
sign(S/r ) as desired. 

Lemma 5.1. The following holds true. 

Klfe^olU > ay /2ci ^||E- To ||f) < 2^- . (31) 

Lemma 5.1 is proved by noting that conditioned on Xt , H^} T rT i s a Gaussian vector and then 
applying standard tail bound inequality. The details are deferred to Section A. 5. 

Using Lemma 5.1 and the assumption r\ < C2^/C m \ n , we get \\u — u||oo < C2A, with probability at 
least 1 - 2p 1 ~ Cl . 

Putting all this together, Eqs. (28) and (29) hold simultaneously, with probability at least 1 — 
4pi— ci This implies the thesis. 

5.2 Proof of Theorem 2.7 

Recall that T = supp(# n ). On the event £ = {T = To}, we have 

= (XjiX-p) ^X-p(Xt>#o,t ~l~ W) = ^o,t "i" (XyXy) ^X^VK , 

where the first equality holds since T = Tq ^ S and thus #o,T c = 0. Further note that #p L — #o,i> for 
i € T, is a zero mean Gaussian vector with variance 

^lle^X^XT^Xff < <7 2 ||Sy^[| 2 /n < a 2 /(nC min ) . 



17 



Using tail bound inequality along with union bounding over i G [p], we get 

p(||#t L - #o,t||oo > fJ,;S) < 2e- nC ™^ 2/2a2 . 
Also, under the assumptions of Theorem 2.5, ¥(£) > 1 — 4p 1_Cl . Hence 

ip(||^t l -o ,t\\oo >//) <ip(||^t l -^o,t||oo >m;£) +p(£ c ) < 2 e - nCmi ^ 2 / 2<j2 + 4 P 1 ~ C1 . 

Since 6>p c L = 6» ,t- = 0, we get ||6» GL - o ||oo < M> wi th probability at least 1 -4p 1_Cl -2e" nC,mh ^ 2/2CT2 . 

Moreover, if ||# GL - O || < min /2, then |0 GL | > 6 min /2 for i G S and |# GL | < # min /2, for % G S c . 
Hence, the so top entries of 9 GL (in modulus), returned by the Gauss-Lasso selector, correspond to 
the true support S. Therefore, 



P(S = S) > P(||^ L - #o||oo < Omm/2) 

> 1 - 4p 1-ci - 2pe" nC ' mine 'min //8 ' T2 > 1 - 6p 1_Cl/4 
where the last inequality follows from the facts 6 m \ a > C2A, and rj < C2y/Cj 

6 Proof of Theorems 3.4 and 3.7 

By the condition (iii) in the statement of the theorem, we have 



A < — min 

3 ies 



< No- 



where the second inequality holds because of Lemma 3.2. Therefore, as a result of Lemma 3.2, we 
have sign(#°°(A)) = vq and that supp(fo) = To contains the true support S. Applying Lemma 3.3 
and using the generalized irrepresent ability assumption, we have 



s T c ,T S To 1 )To t;o,To 



<i-v, 



vo,T = signf 9 ,T Q ~ XT, TotTo v ,T 



(32) 
(33) 



Moreover, by Lemma 2.4, sign(# n ) = vq if Eqs. (11) and (12) hold with z = vq and T = To, namely 

1 



vo,T = sign(6»o,T - E~} To (Xv ,t - r To ) 



< 1 



(34) 
(35) 



The rest of the proof is devoted to show the validity of these equations, with probability lower 
bounded as per Eq. (24). 
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6.1 Proof of Eq. (34) 

It is immediate to see that Eq. (34) holds if the followings hold true: 

71 = W^To^To^tIto^ToW^ < 1 - ^ > ( 36 ) 

T 2 = j\\? To c ~ T, To -,t %It ? t |L < I • (37) 

In order to prove inequalities (36) and (37), it is useful to recall the following proposition from 
random matrix theory. 

Proposition 6.1 ([DS01, Wai09, Verl2]). For k < n, let X G M. nxk be a random matrix with i.i.d 
rows drawn from N(0, E). Then the following hold true for allt>l and r = 2(y / ^ + i) + (\J^ + t) 2 ■ 

(a) IfT, has maximum eigenvalue cr max < oo, then 

|-X T X-E[| 2 > < 7„ un[ T > ) <2e-™* 2 / 2 . 
n J 

(b) IfH has minimum eigenvalue o" m ; n > 0, then 

ix^-S^|| 2 >^r)<2 e -^/ 2 . 

We consider the particular choice of t = \fk~Jn which is useful for future reference. Since k/n < 1, 
we get r < 8^/k/n and therefore the specialized version of Proposition 6.1 reads: 



n 

We define the event E\ as 



^X T X- S|| 2 > 8^/k/^a max ^j < 2e~ k/2 



(38) 
(39) 



S 1 = |[|(S 2b , To )- 1 - E^Ha < 8^nC^ . 



Applying Eqs. (38), (39) to Xr , we conclude that 

F(S^) < 2e~ ta/2 . (40) 
We now have in place all we need to bound the terms 71 and Ti- 



6.1.1 Bounding 7i 

To bound 71, we employ similar techniques to those used in [Wai09, Theorem 3] to verify strict 
dual feasibility. The argument in [Wai09] works under the irrepresent ability condition (see Eq. (26) 
therein) and we modify it to apply to the current setting, i.e., the generalized irrepresent ability 
condition. 
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We begin by conditioning on Xj- . For j G Tq c , Xj is a zero mean Gaussian vector and we can 
decompose it into a linear correlated part plus an uncorrelated part as 



x 



where ej E R n has i.i.d. entries distributed as ejj ~ N(0,Ejj — Ej^S^ To Er j)- 
Letting u = E^c ^E^^tiQ^,,, we write 

Mj = xJXt^X^XtJ-^ctq 

= S i)To (E To ,T ) _ VTo + eJ^o^ToXTo)" 1 ^ • (41) 

The first term is bounded as |Ej ] T (E'r 0) T ) _1 'i;o,Tol < 1— r/asperEq. (32). Let mj = eJXy (Xj o Xy ) -1 ?;o,To- 
Since Var(ejj) < Ejj < 1, conditioned on Xj- , rrtj is zero mean Gaussian with variance at most 

Var(mj) < (^(X^X^) -1 ^ ||1 



< 



1 T /X^XtqN-i 



^^ll^!,Toll2ll^Tol| 2 . (42) 

Under the event £\, we have 

\\%l To h < ll^Vjb + ||E^ Tq - E^ T J| 2 < (1 + 87W^) C mm < 9C mm , (43) 
and hence, Var(mj) < 9io/(raC mm ). We now define the event £ as 



c J ii^ /18cit logp 

c = < max m; > \ — 

[jeTc 1 V nC min 

By the total probability rule, we have 

¥{£) <P(£;£i)+P(£ x c ). 

Using Gaussian tail bound and union bounding over j £ Tq c , we obtain ¥(£ ;£i) < 2p l ~ Cl . Using the 
bound P(£f) < 2e~* 0//2 , we arrive at: 



Using this, together with Eq. (32), in Eq. (41), we obtain that the following holds true with probability 
at least 1 - 2p l ~ Cl - 2e-*°/ 2 : 



Tt<l-v + \l ——^ • (45) 



- 'mm 



It is easy to check that the this implies 71 < 1 — tj/2, for A as claimed in Eq. (21) provided n > 
Mi t log p. 
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6.1.2 Bounding T2 

We bound T 2 by the same technique used in proving Eq. (28). Let m = (1/A)(fr °— ^t c ,t ^To t ^t ) ■ 
Plugging for r, we get m = XtcII x i W/(nX). Since W ~ N(0, cr 2 I nxn ), conditioned on X, the 

Tq 

variable rrij = xjll x x W/ (nX) is normal with variance at most 

Tq 

(^> 2 l|n x ^lll^) 2 ll*#, 

where we used the contraction property of orthogonal projections. Now, define the event £ as follows. 

£=\\\x j f<2n,Vj£[p}X. 



Note that ||xj|| 2 = T,jjZ, where Z is a chi-squared random variable with n degrees of freedom. 
Using the standard chi-squared tail bounds [JohOl], for a fixed j, we have ||xj|| 2 < 25Lj n < 2n, 
with probability at least 1 — e -ri//1 °. Union bounding over j G [p], we obtain ¥(£ c ) < pe _n//1 °. 

Under the event £ , we have Var(mj) < 2a 2 /(nX 2 ). Employing the standard Gaussian tail bound 
along with union bounding over j G Tq , we obtain 

,2 2 

P(75 > r//2; 5) < 2pe~T5^ = 2p 1 ~ Cl . (46) 

Hence, 

IP(T 2 > t?/2) < P(T 2 > r?/2; £) + ¥{£ c ) < 2p^ Cl +pe~^ . (47) 
6.2 Proof of Eq. (35) 

We next prove Eq. (35). Given Eq. (33), we need to show 

Let u = 6»o,t - ^To,t v o,t , and u = 6> ,t - ^To,t (M),t - r To ). 

By condition (iii), we have, for alH G S, > |6> ,i|-A| [Y,^ To v 0tTo ]i\ > c 2 A+(l/2)A|[E^ To 'i;o,To]i|- 
Further, for all i G Tq\S, we have \ui\ = X\[T,^ To vo,T ]i\ > c 2X + (1/2)X\[T,^ Tq vo } t ]i\- Summarizing, 
for all i G To, we have 

K| > c 2 A + ^Alp^^Tolil • 

We will show that \ui — Ui\ < c 2 A + (1/2) A| [E^j^o.To]*! for all i G To, with high probability, thus 

implying sign (u To ) = sign(2 To ) as desired. Since \ui~Ui\ < X\[(E~^ Tq - ^To,T ) v o,T ]i\ + IPtoVq^o^' 
it suffices to show that 

= AI^S" 1 ^ - Z-* To )v ,T )i\ < -X\[E^ To v , To ]i\ for all i G T , (48) 
7i = 1 1 T,^ To r To | |oo < c 2 A. (49) 
In the sequel, we provide probabilistic bounds on Tz{i) and Ta- 
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6.2.1 Bounding 

Lemma 6.2. Under the assumptions of Theorem 3.4, for any d > 1, io > A, we have 

P j 3 * e T o s.t. |[(S^ To - E^Juo^Jil > 16^*^^|[S^ To% T ]i|| < 2e"^ + 2p 1 " c ' , 

w/iere c* = (c 2 C m i n )~ 2 . 

The proof of Lemma 6.2 is presented in Section A. 6. 

Applying this lemma, with probability at least 1— 2e~* 0//2 — 2p 1 ~ Cl , we have 73(2) < (l/2)A|[S^ o 1 To t>o i T ]j 
provided 

V n ~ 2 ' 

i.e., for n > M3 to log p. 



6.2.2 Bounding Ta 

Lemma 6.3. The following holds true. 



7i < 3a A /^^ ) > 1 - 2e"T - • (50) 



nQ 

Lemma 6.3 is proved in Section A. 7. 

tg -1 

From the last lemma, it follows that Eq. (49) holds with probability at least 1 — 2e~~2~ — 2p 01 , 
provided 



3 /2cibgP< c2A 

V 1 L/min 

Choosing A as per Eq. (21), the latter is easily shown to follow from rj < C2V / C n 



6.3 Summary: Proof of Theorem 3.4 

Now combining the bounds on 71,. . . Ta, we get that for n > max(Mi, M3) to logp, Eqs. (34) and (35) 
hold simultaneously, with probability at least 1— pe~ n / 10 — 6e~ 4 °/ 2 — 8p 1_Cl . This implies sign(# ra (A)) = 

Vq. 



6.4 Proof of Theorem 3.7 

Note that the matrix Xt is a random Gaussian matrix with rows drawn independently form 
N(0, Et ,t ) (recall that To is a deterministic set determined by the population-level problem). There- 
fore, IISj^t}) II2 < ^II^To To H2 < 9Cmin- Using Theorem 3.4 to bound the probability that T ^ To, the 
proof proceeds along the same lines as the proof of Theorem 2.7. 
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A Proof of technical lemmas 
A.l Proof of Lemma 2.1 

By a change of variables, it is easy to see that ZN (^) = #o + £u(£)> where u(£) = arg min ue ]RP F(u;£) 
and 

F(u;0 = -(u,Eu) + ||« 5 c||i + (||r^o,5 + «s||i - llr^o,5||i) • 

The rest of the proof is analogous to an argument in [BRT09]. Since, by definition, F(u;£) < 
F(0; £), we have 

+ ||%c||i - \\us\\i < (51) 
and hence ||tisc||i < ||u<j||i. Using the definition of k, with J = S, cq = 1, we have 

o > ^/?(s , + ll% e lli - ll^slli 

> 7^{ s oA)\\us\\l - ll%l|l , 

and since H^sHl > ||^s|li/so> we deduce that 

2s 



< 



k(s , 1) 

By Eq. (51), this implies in turn 



(u, ES> < ^, 4S ° . . (52) 



k(sq, 1) 

Now, consider the stationarity conditions of F. These imply 

(Sit)j = — sign(nj) , for % G T \ S. 

We therefore have 

\T\ S\ < (£u) 4 2 < < \\E\\ 2 {u, Eu) , 

ieT\S 

and our claim follows by substituting Eq. (52) in the latter equation. 
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A. 2 Proof of Lemma 2.2 



By a change of variables, it is easy to see that ZN (^) = Oq + £u(£), where n(£) = argmin ug KP F(n;£) 
and 

F(u;0 = ~{u,%u) + \\u S c\\i + (||r^o,5 + «s||i " llrX,s||i) • 
Notice that, for any u £ MP, lim^^o F{u; £) = Fq(u), where 

F o(u) = ^(u,^u) + \\us4i + (sign(0o,s),us) ■ 

Indeed F(u;£) = Fq(u) provided £ < min^s \0o,i/ui\. Further, F(u;£) > Fq(u) for all u. 

Let no = argmin ug ]Rp Fq(u), and set £o = urines' | #0,1/^0,1 1- Then, for any u ^ uq, and all 
£ £ (0, £o)> w e have 

F(u; > F (u) > F (no) = F{u ; £) . 

Hence no is the unique minimizer of F(n;£), i.e., n(£) = no for all £ E (0, £o)- 

It follows that ^ N (0 = o +£ «o for all £ G (0, f„) and hence sign(^ N (C)) = v and supp(^ ZN (£)) = 
Tq where we set 



vo,s = sign(tf ,sj , 
vo,s? = sign(n ,s0 , 
T = S U supp(n ) . 

Finally, the zero subgradient condition for no reads Sno + z = 0, with z$ = sign(#o,s) an d zs<= 6 
^||i*o,s c 111- In particular, zr = uo,T an d therefore no,T = — ^t t v To- This implies 



£0 = rnin 



^0, 



mm 

i&S 



[^tIt v O,To\ 



A. 3 Proof of Lemma 2.3 

Writing the zero-subgradient conditions for problem (5), we have 

8(0^ - O ) = uedll^Hi. 

Given that T D S, we have #o,T c = 0, and thus 

£ TT (0 T N — 6*o,t) = — C'Wt , 

Solving for 6^ N — #o,T in terms of ur, we obtain 

YjT c ,T^t \' u t = u t c , 
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This proves the 'only if part noting that ut = sign(#|, N ) = vt, and ||iir c ||oo < 1 since u G ||i- 
Now suppose that Eqs. (8) and (9) hold true. 

Let Ot = Oq,t — and #t c = 0. We prove that 9 = 6 ZN , by showing that it satisfies 

the zero-subgradient condition. By Eq. (9), vt = signer). Define u G M p by letting = and 
n TC = Stc^^tt^t- Note that ||ut c ||oo < 1 by Eq. (8), and so u G Moreover, 

^t c ,t(^t — Go,t) = —£,ut c j 

Combining the above two equations, we get the zero-subgradient condition for (9,u). Therefore, 
9 = 9^^ , and v = sign(^ ZN ). 

A. 4 Proof of Lemma 2.4 

The proof proceeds along the same lines as the proof of Lemma 2.3. We begin with proving the 'only 
if part. The zero-subgradient condition for Problem 3 reads: 

--X T (Y - X£ n ) + Xu = , u£d\\9 n \\i. 
n 

Plugging for Y = X#o + W and r = (~XJW/n) in the above equation, we arrive at: 

E(0 n -0 ) = r-Xu. 

Since T ^> S, #o,T c = 0, and writing the above equation for indices in T and T c separately, we obtain 

S^c t(9j< — 9q x) = i"T c — Xux c , 
YjX,t{9j' — Qq t) = r T ~ Xut ■ 

Solving for 9^ — #o,T from the second equation, we get 

0£ = 9 0iT - E^iXuT - r T ) . 

This proves Eqs. (11) and (12), since ut = sign(6^) = zt and ||mt c ||oo < 1- 

We next prove the other direction. Suppose that Eqs. (11) and (12) hold true. Let 9t = 
#o,T — S^^(Azt — fr), and 9t c = 0. We prove that 9 = 9 n , by showing that it satisfies the 
zero-subgradient condition. By Eq. (12), zt = sign(^r). Define u G W by letting ut = zt and 
u T c = f±T c ,T^T 1 T ZT + ^ Tc ~ ^T c ,rS^^ r f'r)/A. Note that ||«r c ||oo < 1 by Eq. (12), and so u G 
Moreover, 

Yjt,t{9t ~ $ot) = — {Xut — tt) 
Yj r pc j'(9 r p — 9qt) = —{Xu r pc — rx c ) , 

Combining the above two equations, we get the zero-subgradeint condition for (9,u). Therefore, 
9 = 9 n , and z = sign(0 n ). 
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A. 5 Proof of Lemma 5.1 



Let m = Y> T ^ To rT = (Xj Q Xr ) 1 Xj o W. Conditioned on Xy , m; is a zero mean Gaussian vector 
with variance <7 2 ||eJ(X7' X;r )~ 1 XjJ| 2 . By a Gaussian tail bound, we get 



\rrii 

Further, notice that ||eJ r (Xj o X'r ) _1 Xj o || 2 < HS^ 1 ^ \\2/n. By union bounding over i = 1, we 
have 

HI«>V^lfc.V„li; /2 )£V- 

A. 6 Proof of Lemma 6.2 

We begin by stating and proving a lemma that is similar to Lemma 5 in [Wai09], but provides a 
stronger control. 

Lemma A.l. Let Z € W nxk be a random matrix with i.i.d. Gaussian rows with zero mean and 
covariance S, with k > 4. Further let a±, . . . , au £ ^ k and b\, ■ ■ ■ , 6m £ fre non-random vectors. 
Then, letting £z = Z T Z/n, we /icwe, /or a// A > 0: 

Fhi€ [M] s.t. ^(E^ 1 -S- X )6i)| > SyiKa^S" 1 ^)! +A||S- 1 / 2 a i || 2 ||S- 1 / 2 6 i || 2 l 



fe r nA 2 1 , . 

<2e"2 +2Mexp[-— }. (53) 



Proof. First notice that Z = ZX 1 / 2 with Z £ M nxfc a random matrix with i.i.d. standard Gaussian 
entries ~ N(0, 1). By substituting in the statement of the theorem, it is easy to check that we 
only need to prove our claim in the case £ = Ikxk (i- e -; f° r Z with i.i.d. entries), which we shall 
assume hereafter. 

Defining the event = — 1 1 1 2 < 8\/k/n}, we have, by Eq. (39) and the union bound, 

P | Eli e [M] s.t. ^(IT 1 -T)bi)\ > 8^1(0*, 6j)| + A Hojllallftilbj < 

2e~ k / 2 + M max P J \{a t , (5T 1 - 1)6^)1 > 8 \ k \(a,,b t )\ + A ||^]| 2 ||foi|| 2 ; £* \ 
ie[M] I 1 1 V n I 

We can now concentrate on the last probability. Let a = |(aj, bi)\ and j3 = (HaiHlll&illi ~~ ( a «i bi) 2 ) 1 ^ 2 - 
Since £ is distributed as i?Si? T for any orthogonal matrix R, we have 

(a u (S- 1 - I)bi) = a(e 1} (S" 1 - I) ei ) + p( ei , (S" 1 - I)e 2 ) , 

where = denotes equality in distribution. Under the event £*, we have |a(ei, (S _1 — I)ei)| < 8a>y/k/n. 
Further (S _1 — I) = UDU T with U a uniformly random orthogonal matrix (with respect to Haar 
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measure on the manifold of orthogonal matrices). Letting u\, 112 denote the first two rows of U we 
then have 

P | |{a,, (g- 1 - I)^)| > 8^\(a h bi}\ + AWmhWbih; £*j < P{|(ui,£>U2>| > A; £*} ■ 

Notice that conditioned on ui and D, u\ is uniformly random on a {k — l)-dimensional sphere. 
Further, letting V2 = Du2, we have [| f 2 H2 < 8y/k/n. Hence, by isoperimetric inequalities on the 
sphere [LedOl], we obtain 

u\,Du 2 )\ > A; £*} < sup P{|(m, v 2 )\ > A| v 2 } 

\\v2\\<8-\/kJn 

(fc-2)A 2 i r nA 2 



128/c/n 

where the last inequality holds for all k > 4. The proof is completed by substituting this inequality 
in the expressions above. □ 

We are now in position to prove Lemma 6.2. 

Proof (Lemma 6.2). We apply Lemma A.l to E = T<t ,t , M = to, Oj = e« and b{ = t>o,T for 
i E {l,...,t }. We get 

P | 3i 6 T s -t- |[(^2b,2b ~ ^ToVo^O^o]*! - 8 \f^\l T, TlT v 0,T ]i\ + A II E Tofr e i h II ^To^O.To II 2 1 < 

2e-W 2 + 2 t exp{-^}. 

Note that llSy^ej^llSy^^rolh ^ ^inlNMN.Tb II2 = C^V^o- Further P^t^o^I > 2c 2 , 

.1/2 1/2 1 1 

and hence ||S 2b ^ o e i ||2||S 2b) ^ b uo,T ||2 < (1/2)^**0 IPto.To^O.ToIiI- We therefore get 

P jai 6 T s.t. [[(S^ - E^; To )ityZb]i| > ( 8 \f^+ yvCT')|[S^ 1 >To «o ) 2b]i|| < 

2e-W 2 + 2toexp {_^}. 



The proof is completed by taking A = 16-y/ (c ; logp)/n. □ 

A. 7 Proof of Lemma 6.3 

By Lemma 5.1, we have 



P (JI S T ,T r Toll°o > " ll S r ,Toll2 J < 2 P • 

Recalling Eq. (43), under the event E\ we have ||Sy 1 T ||a < 9( 7" ^ . Since P(£f) < 2e~*°/ 2 , we arrive 
at: 

p(l|S^IU>3./fg?)<2 P -. +2e -t. 
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B Generalized irrepresentability vs. irrepresentability 



In this appendix we discuss the example provided in Section 1.1 in more details. The objective is 
to develop some intuition on the domain of validity of generalized irrepresentability, and compare it 
with the standard irrepresentability condition. 

As explained in Section 1.1, let S = supp(#o) = {1, • • • > &o} an d consider the following covariance 
matrix: 



1 if % = j, 

a if i = p, j G S or i G S, j = p, 
otherwise. 



Equivalently, 



£ = Ipxp + a(e p u~s + usej) , 



where us is the vector with entries (us)i = 1 for i G S and (us)i = for i G" S. It is easy to check 
that S is strictly positive definite for a G (— l/^/so, +l/- v /so). By redefining the p-th. covariate, we 
can assume, without loss of generality, a G [0, +l/y / so). We will further assume sign(#o,i) = +1 for 
all i G S. 

This example captures the case of a single confounding variable, i.e., of an irrelevant covariate 
that correlates strongly with the relevant covariates, and with the response variable. 

We will show that the Gauss-Lasso has a significantly broader domain of validity with respect to 
the simple Lasso. 

Claim B.l. Consider the Gaussian design defined above, and suppose that a > 1/sq. Then for any 
regularization parameter A and for any sample size n, the probability of correct signed support recovery 
with Lasso is at most 1/2. (and is not guaranteed with high probability unless a G [0, (1 — rj)/ sq\, for 
some constant r/ > 0. 

On the other hand, Theorem 3.7 implies correct support recovery with the Gauss-Lasso from 
O(sologp) samples, for any 



n 



a G 



0, 



1 



V 



so 



u 



1 1 



so 



(54) 



Proof. In order to prove that Gauss-Lasso correctly recovers the support of 9q, we will show that all 
the conditions of Theorem 3.4 and Theorem 3.7 hold with constants of order one, provided Eq. (54) 
holds. Vice versa, the irrepresentability condition does not hold unless a G [0, 1/so), and hence the 
simple Lasso fails outside this regime. 

We now proceed to check the assumptions of Theorems 3.4 and 3.7, while showing that irrepre- 
sentability does not hold for a > 1/so. 

Restricted eigenvalues. We have A m j n (S) = 1 — a^/s~o. In particular, for any set T C [p], we have 
A m i n (ST,r) > 1 — a\fs~o > r\. Also, for any constant cq > 0, k(sq, cq) > 1 — a^/so > rj. 



Irrepresentability condition. We have S55 



Isoxso and hence 



ISs^S^sS'lloo — ||Sp,s||l 



as . 



Hence the irrepresentability condition holds only if a G [0, 1 /so)- The corresponding irrepresentability 
parameter is rj = 1 — qsq. 
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For large sq, the condition is only satisfied for a small interval in a, compared to the interval for 
which £ is positive definite. 

Generalized irrepresentability condition. In order to check this condition, we need to compute 
To and vq defined as per Lemma 3.2. We have = argming g KP G(9;£) where 

G(9; = \ ((0 - B ), z(e - 6 )) + £\\e\\ i 

= \\\ e - Ml + a{u s , (9 S - & ,s))0 p + Z\\e\h ■ 

From this expression, it is immediate to see that = for i G" SL){p}. Further #5^{ p }(£) satisfies 

Os ~ ,s + ae p us + £v s = 0, (55) 
p + a{u s ,(6s-9o,s))+tv p = Q, (56) 

with vs G £?||#s||i and v p G d\9 p \. Since #o,s > 0, we have, from Eq. (55), 

Of = 9 ,s - (a9™ + Ons , 
provided (a0£° + £) < 9 m \ a . Substituting in Eq. (56) and solving for 9 P , we get 



ICC 



(0 



if a G [0,1/so) 

£ if a G [l/s , l/y/so). 



asQ — l 
1— a 2 so 



This holds provided (a6>£° + f) < min , i.e., if | < = min(l, (1 - a 2 s )/(l - a)) D 
Using the definition in Lemma 3.2, we have 



o 



S if a G [0,1/so) 

SU{p} ifae[l/s ,lA/5o)> 



and u ,t = ^T - 

We can now check the generalized irrepresentability condition. For a G [0, 1/so) we have 
||Src iTo Sy T Vo,t || oo = llSs^S^s^slloo = as o> an d therefore the generalized irrepresentability con- 
dition is satisfied with parameter n = 1 — aso- For a G [1/so, l/\Ao)> we have ||Src )To Sy r i>o,2b||oo = 
0. '" 

We therefore conclude that, for any fixed n G (0, 1], the generalized irrepresentability condition 
with parameter 77 is satisfied for 



a G 



1—77 

o, — - 

so 



u 



1 1 



a significant larger domain than for simple irrepresentability. 

Minimum entry condition. For a G [0, 1/sq), we have To = S and it is therefore only necessary to 
check Eq. (22). Since [^Tq T v o,T ]i = 1, this reads 



h, i \>\c2+*)x = C* ] ll0gP 



n 
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with C a constant. 

For a E (l/so ; (1 — v)/ V^oli we have To = S U {p}. A straightforward calculation shows that 

1-a 



I i^To,T v O,T }i | = 1 _ Q2gQ , for i G 5 , 



,-1 i I « s o - 1 



It is not hard to show for all a satisfying Eq. (54), we have 



I PTn.j 



T y v 0,T ]i\ < l _ {l _ r] y 



for ie S, | [S To 1 )To 'uo 1 ro] 33 | > C , 



for some constant C > 0. It therefore follows that condition (22) holds if |#o,i| > C'cry/\ogp/n and 
condition (23) holds for c 2 = C/2. " □ 
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